Critical Survey of the Freely Available Arabic Corpora
نویسنده
چکیده
The availability of corpora is a major factor in building natural language processing applications. However, the costs of acquiring corpora can prevent some researchers from going further in their endeavours. The ease of access to freely available corpora is urgent needed in the NLP research community especially for language such as Arabic. Currently, there is not easy was to access to a comprehensive and updated list of freely available Arabic corpora. We present in this paper, the results of a recent survey conducted to identify the list of the freely available Arabic corpora and language resources. Our preliminary results showed an initial list of 66 sources. We presents our findings in the various categories studied and we provided the direct links to get the data when possible.
منابع مشابه
Arabic News Articles Classification Using Vectorized-Cosine Based on Seed Documents
Besides for its own merits, text classification (TC) has become a cornerstone in many applications. Work presented here is part of and a pre-requisite for a project we have overtaken to create a corpus for the Arabic text process. It is an attempt to create modules automatically that would help speed up the process of classification for any text categorization task. It also serves as a tool for...
متن کاملThe design of a corpus of Contemporary Arabic
Corpora are an important resource for both teaching and research. Arabic lacks sufficient resources in this field, so a research project has been designed to compile a corpus, which represents the state of the Arabic language at the present time and the needs of end-users. This report presents the result of a survey of the needs of teachers of Arabic as a foreign language (TAFL) and language en...
متن کاملProposed Framework for the Evaluation of Standalone Corpora Processing Systems: An Application to Arabic Corpora
Despite the accessibility of numerous online corpora, students and researchers engaged in the fields of Natural Language Processing (NLP), corpus linguistics, and language learning and teaching may encounter situations in which they need to develop their own corpora. Several commercial and free standalone corpora processing systems are available to process such corpora. In this study, we first ...
متن کاملFree text corpora and their application to community retrieval evaluation
This technical report surveys text data sets that are freely available for the evaluation of classification and retrieval systems. Based on this survey, a method is proposed to re-interpret the features of relational corpora to obtain a data set for community-oriented tasks, answering the lack of testing data available.
متن کاملANERsys 2.0: Conquering the NER Task for the Arabic Language by Combining the Maximum Entropy with POS-tag Information
In this paper we describe an improved version of ANERsys, an Arabic Named Entity Recognition system for open-domain texts. The first version of ANERsys was totally based on the Maximum Entropy approach and was trained and tested with corpora which we have built ourselves. The results showed that the Maximum Entropy is an appropriate method to identify Named Entities in Arabic texts. However, in...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- CoRR
دوره abs/1702.07835 شماره
صفحات -
تاریخ انتشار 2017